Here we will work with fungal ITS amplicon data processed using similar methods to those described in the first part of the workshop. These data were obtained from soil and root samples collected from plots receiving different fertiliser treatments. We also have additional metadata for these samples, in the form of several chemical and biological parameters, that we will link to the amplicon data and use in our analyses.
Here, I will use base-R and tidyverse functions to show you how you can do this. There are specialised packages for, for example, reading and manipulating data in biom format, but it is simple enough to use approaches that are generally applicable to reading and manipulating data. I prefer to show it this way because it is very flexible and emphasises the importance of having a well-structured and organised workflow. The example used here demonstrates the importance of this flexibility since the sequencing data and metadata were collected at different hierarchical levels. However, we will use some convenient functions from the phyloseq libraries to inspect and subset/prune the data.
Most of the libraries used here can be installed from CRAN using, for example:
install.packages('vegan')
Some packages are only available on github, these can be installed as follows:
install.packages('devtools')
devtools::install_github('phyloseq/joey711')
devtools::install_github('cpauvert/psadd')
devtools::install_github('jfq3/QsRutils')
devtools::install_github('jfq3/ggordiplots')
devtools::install_github('pmartinezarbizu/pairwiseAdonis/pairwiseAdonis')
Our OTU data are stored in a tab-delimited text file, generated using biom convert on a .biom file. The first few rows are shown in the screengrab below.
The first row contains information that is not useful for us, so we will skip reading it.
# read OTU data, calling this 'dat'
# use the 'skip' argument to not read in the first row
# use the 'stringsAsFactors' argument to read OTU names as character
dat <- read.delim('data/nutnet_ITS.txt', skip=1, stringsAsFactors=F)
# inspect the structure of the data
str(dat)
## 'data.frame': 3504 obs. of 33 variables:
## $ X.OTU.ID : chr "ITSall_OTUa_524" "ITSall_OTUa_78" "ITSall_OTUa_8945" "ITSall_OTUc_1" ...
## $ barcodelabel.JP10.: int 4990 803 59 2971 1262 441 1532 167 1 107 ...
## $ barcodelabel.JP11.: int 176 487 37 8992 282 721 3307 561 59 42 ...
## $ barcodelabel.JP12.: int 30 18 26 116 80 235 158 47 18 31 ...
## $ barcodelabel.JP13.: int 89 50 51 1553 109 289 2087 578 17 56 ...
## $ barcodelabel.JP14.: int 11 0 371 1781 894 1451 5140 398 9 672 ...
## $ barcodelabel.JP15.: int 33 398 9 1678 333 1010 2455 532 214 159 ...
## $ barcodelabel.JP16.: int 405 113 79 2482 1197 325 2406 155 872 8302 ...
## $ barcodelabel.JP17.: int 10358 7662 1 158 1130 1347 843 3643 3885 8 ...
## $ barcodelabel.JP18.: int 10611 2001 323 42 262 1910 133 705 4008 917 ...
## $ barcodelabel.JP19.: int 112 188 500 112 990 462 230 152 210 308 ...
## $ barcodelabel.JP1. : int 170 7 2 91 2 35 71 9 5 3 ...
## $ barcodelabel.JP20.: int 499 2429 3081 382 2943 5285 2529 6200 3498 373 ...
## $ barcodelabel.JP21.: int 12338 7136 8152 721 1346 6263 1849 1763 2123 8 ...
## $ barcodelabel.JP22.: int 5359 5982 10495 139 3812 2398 923 490 6222 2232 ...
## $ barcodelabel.JP23.: int 4598 9549 9407 114 1627 3099 419 1072 464 280 ...
## $ barcodelabel.JP24.: int 15763 1034 581 142 3080 2609 387 464 785 210 ...
## $ barcodelabel.JP25.: int 7941 5045 1647 469 4497 1631 899 495 113 399 ...
## $ barcodelabel.JP26.: int 7059 8749 2077 240 7373 829 192 1609 133 34 ...
## $ barcodelabel.JP27.: int 10335 7932 4564 440 4093 4114 320 3733 720 2054 ...
## $ barcodelabel.JP28.: int 1056 2686 5734 116 5478 2293 85 503 361 9062 ...
## $ barcodelabel.JP29.: int 745 2204 441 212 374 2140 590 8724 70 0 ...
## $ barcodelabel.JP2. : int 10 1091 37 2040 0 114 1051 1 1477 110 ...
## $ barcodelabel.JP30.: int 1578 784 8096 283 1624 1534 805 1137 137 207 ...
## $ barcodelabel.JP31.: int 55 40 176 3 142 189 13 120 23 6 ...
## $ barcodelabel.JP32.: int 4849 392 4820 205 2977 1099 243 1292 558 360 ...
## $ barcodelabel.JP3. : int 0 1097 57 4124 38 858 1368 32 152 657 ...
## $ barcodelabel.JP4. : int 683 594 79 4521 54 512 3353 840 2989 102 ...
## $ barcodelabel.JP5. : int 1172 101 71 5285 412 628 3077 10 804 0 ...
## $ barcodelabel.JP6. : int 20 349 626 4882 172 904 1911 500 785 618 ...
## $ barcodelabel.JP7. : int 33 248 1097 2828 75 833 2387 37 2 260 ...
## $ barcodelabel.JP8. : int 2701 329 459 3575 1471 955 2877 571 127 173 ...
## $ barcodelabel.JP9. : int 44 19 35 407 0 137 303 118 17 66 ...
First we will tidy up the names of our columns, since there is extra text in the sample names introduced during sequence processing (these samples were labelled JP# when submitted for sequencing). Then we will transpose the data so that samples are in rows and OTUs are in columns.
# remove the extra text around the sample labels ('JP#') using 'gsub'
# arguments are 1: pattern to find, 2: replacement, 3: where to look
# 'fixed=T' is needed because, without it, '.' is interpreted as any symbol
names(dat) <- gsub('barcodelabel.', '', names(dat), fixed=T)
names(dat)
## [1] "X.OTU.ID" "JP10." "JP11." "JP12." "JP13." "JP14."
## [7] "JP15." "JP16." "JP17." "JP18." "JP19." "JP1."
## [13] "JP20." "JP21." "JP22." "JP23." "JP24." "JP25."
## [19] "JP26." "JP27." "JP28." "JP29." "JP2." "JP30."
## [25] "JP31." "JP32." "JP3." "JP4." "JP5." "JP6."
## [31] "JP7." "JP8." "JP9."
names(dat) <- gsub('.', '', names(dat), fixed=T)
names(dat)
## [1] "XOTUID" "JP10" "JP11" "JP12" "JP13" "JP14" "JP15"
## [8] "JP16" "JP17" "JP18" "JP19" "JP1" "JP20" "JP21"
## [15] "JP22" "JP23" "JP24" "JP25" "JP26" "JP27" "JP28"
## [22] "JP29" "JP2" "JP30" "JP31" "JP32" "JP3" "JP4"
## [29] "JP5" "JP6" "JP7" "JP8" "JP9"
# load the 'dplyr' package to use functions for data manipulation
library(dplyr)
# extract the read counts into a separate dataframe, calling this 'mat'
# use the 'select' function from the 'dplyr' package
# first argument is the dataframe to use, others refer to columns to keep/lose
mat <- select(dat, -XOTUID)
# convert 'mat' to a matrix object and look at the object structure
mat <- as.matrix(mat)
str(mat)
## int [1:3504, 1:32] 4990 803 59 2971 1262 441 1532 167 1 107 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:32] "JP10" "JP11" "JP12" "JP13" ...
# use the column of OTU names in 'dat' as row names in 'mat'
# this is okay to do because we have not changed the order of rows or columns in either object
# the '$' indicates that we want to use the column in 'dat' that is named 'XOTUID'
rownames(mat) <- dat$XOTUID
str(mat)
## int [1:3504, 1:32] 4990 803 59 2971 1262 441 1532 167 1 107 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:3504] "ITSall_OTUa_524" "ITSall_OTUa_78" "ITSall_OTUa_8945" "ITSall_OTUc_1" ...
## ..$ : chr [1:32] "JP10" "JP11" "JP12" "JP13" ...
# transpose the matrix using the 't' function (samples -> rows, OTUs -> columns)
mat <- t(mat)
str(mat)
## int [1:32, 1:3504] 4990 176 30 89 11 33 405 10358 10611 112 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:32] "JP10" "JP11" "JP12" "JP13" ...
## ..$ : chr [1:3504] "ITSall_OTUa_524" "ITSall_OTUa_78" "ITSall_OTUa_8945" "ITSall_OTUc_1" ...
# bind the OTU counts in 'mat' to a named column of sample IDs (the rownames of 'mat',
# name this column 'SampleID') and, in the process, replace 'dat' with a new dataframe object
# use the 'stringsAsFactors' argument to treat OTU names as character
#
dat <- data.frame(SampleID=rownames(mat), mat, stringsAsFactors=F)
# don't look at the structure of 'dat' since it contains a few thousand columns
# but you can check the dimensions using the 'dim' function (returns numbers of
# rows and columns) to make sure that it has only one more column than 'mat'
dim(mat)
## [1] 32 3504
dim(dat)
## [1] 32 3505
# clear 'mat' from the workspace
rm(mat)
The metadata are stored in three .csv files. The first contains the information regarding the sampled material (root or soil) and the plot IDs indicating the plots from which those samples were collected.
The second contains the information about what treatments were applied to the plots and in which block each plot was found.
The third contains several soil chemical and enzymatic measurements that were taken on the same soil samples.
# in each case, use the 'stringsAsFactors' argument to treat OTU names as character
# we can change some variables to factors later, this is easier
# read in sample information
samps <- read.csv('data/nutnet_samples.csv', stringsAsFactors=F)
str(samps)
## 'data.frame': 32 obs. of 3 variables:
## $ SampleID : chr "JP1" "JP2" "JP3" "JP4" ...
## $ SampleType: chr "soil" "soil" "soil" "soil" ...
## $ PlotID : int 44 45 40 42 36 29 33 22 32 5 ...
# 'SampleType' can be converted to a factor variable
samps$SampleType <- factor(samps$SampleType)
str(samps)
## 'data.frame': 32 obs. of 3 variables:
## $ SampleID : chr "JP1" "JP2" "JP3" "JP4" ...
## $ SampleType: Factor w/ 2 levels "root","soil": 2 2 2 2 2 2 2 2 2 2 ...
## $ PlotID : int 44 45 40 42 36 29 33 22 32 5 ...
summary(samps)
## SampleID SampleType PlotID
## Length:32 root:16 Min. : 3.00
## Class :character soil:16 1st Qu.:14.00
## Mode :character Median :25.50
## Mean :24.88
## 3rd Qu.:37.00
## Max. :45.00
# read in plot information
plots <- read.csv('data/nutnet_plots.csv', stringsAsFactors=F)
str(plots)
## 'data.frame': 16 obs. of 3 variables:
## $ BlockID : int 4 4 4 4 3 3 3 3 2 1 ...
## $ PlotID : int 44 45 40 42 36 29 33 22 32 5 ...
## $ Treatment: chr "NP" "control" "P" "N" ...
# 'BlockID' and 'Treatment' can be converted to factor variables
# use the 'levels' argument to control the ordering of treatments (default is alphabetical)
# don't convert 'PlotID' to a factor variable yet
plots$BlockID <- factor(plots$BlockID)
plots$Treatment <- factor(plots$Treatment, levels=c('control', 'N', 'P', 'NP'))
str(plots)
## 'data.frame': 16 obs. of 3 variables:
## $ BlockID : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 3 3 3 3 2 1 ...
## $ PlotID : int 44 45 40 42 36 29 33 22 32 5 ...
## $ Treatment: Factor w/ 4 levels "control","N",..: 4 1 3 2 1 2 3 3 4 3 ...
summary(plots)
## BlockID PlotID Treatment
## 1:4 Min. : 3.00 control:4
## 2:4 1st Qu.:14.00 N :4
## 3:4 Median :25.50 P :4
## 4:4 Mean :24.88 NP :4
## 3rd Qu.:37.00
## Max. :45.00
# read in additional soil data
soils <- read.csv('data/nutnet_soildata.csv', stringsAsFactors=F)
str(soils)
## 'data.frame': 16 obs. of 19 variables:
## $ PlotID : int 3 4 5 8 16 18 21 22 29 32 ...
## $ Microbial_Biomass_C: num 2.16 2.44 1.89 2.74 2.12 ...
## $ Microbial_N : num 0.092 0.118 0.123 0.135 0.142 0.106 0.157 0.126 0.156 0.164 ...
## $ Microbial_P : num 0.18 0.061 0.331 0.065 0.169 0.053 0.062 0.184 0.198 0.061 ...
## $ Total_C_mg_g_soil : num 119.8 401.5 56.9 422.6 125.9 ...
## $ Total_N_mg_g_soil : num 5.09 19.38 3.7 20.87 8.44 ...
## $ Total_P_mg_g_soil : num 2508 4228 2848 2943 5718 ...
## $ X..Moisture : num 15.9 16.7 15.9 15.7 14.5 ...
## $ DOC_Soil : num 0.208 0.083 0.305 0.081 0.267 0.2 0.123 0.247 0.328 0.232 ...
## $ Extractable_N_Soil : num 0.007 0.011 0.009 0.012 0.011 0.021 0.014 0.011 0.008 0.012 ...
## $ Extractable_P_Soil : num 0.051 0.007 0.107 0.002 0.061 0.002 0.002 0.052 0.049 0.002 ...
## $ BG_Enz : num 144.6 54.7 109 136 117.8 ...
## $ NAG_Enz : num 74.4 41.2 36.8 68.5 45.2 ...
## $ LAP_Enz : num 27.7 25.6 29.7 30.8 24.8 ...
## $ N_Enz : num 102.1 66.8 66.5 99.4 70 ...
## $ PHOS_Enz : num 369 284 463 597 411 ...
## $ CN_Enz : num 1.417 0.819 1.638 1.369 1.682 ...
## $ NP_Enz : num 0.276 0.235 0.144 0.167 0.171 ...
## $ CP_Enz : num 0.391 0.193 0.235 0.228 0.287 ...
Having all of these data in different objects isn’t convenient since we want to make comparisons among the different types of data. To do this, we need to make sure that the data in the different objects are aligned row-by-row. To avoid making mistakes downstream, I prefer to combine all of these data into a single dataframe instead of sorting or indexing objects separately. This is easy to do using the ‘join’ functions from the ‘dplyr’ package, especially if we take care to ensure that the columns that we use to cross-index between dataframes are named consistently (although there is a workaround if not). It is much easier than doing this manually given the fact that some objects have twice the number of rows due to two sample types were sequenced for each plot.
# we use 'full_join' here because there are no extra rows in our various dataframes
# but type '?join' in the console to see the many flexible options for joining dataframes
# join the sample and plot information into a single object called 'df'
df <- full_join(samps, plots)
## Joining, by = "PlotID"
summary(df)
## SampleID SampleType PlotID BlockID Treatment
## Length:32 root:16 Min. : 3.00 1:8 control:8
## Class :character soil:16 1st Qu.:14.00 2:8 N :8
## Mode :character Median :25.50 3:8 P :8
## Mean :24.88 4:8 NP :8
## 3rd Qu.:37.00
## Max. :45.00
# join the soil data to 'df' (note that the soil data will be duplicated across the two
# 'SampleTypes' since this is unique to each plot, not each sample that was sequenced)
df <- full_join(df, soils)
## Joining, by = "PlotID"
summary(df)
## SampleID SampleType PlotID BlockID Treatment
## Length:32 root:16 Min. : 3.00 1:8 control:8
## Class :character soil:16 1st Qu.:14.00 2:8 N :8
## Mode :character Median :25.50 3:8 P :8
## Mean :24.88 4:8 NP :8
## 3rd Qu.:37.00
## Max. :45.00
## Microbial_Biomass_C Microbial_N Microbial_P Total_C_mg_g_soil
## Min. :1.855 Min. :0.0920 Min. :0.05300 Min. : 56.92
## 1st Qu.:1.979 1st Qu.:0.1217 1st Qu.:0.06175 1st Qu.:108.47
## Median :2.151 Median :0.1490 Median :0.10700 Median :211.53
## Mean :2.187 Mean :0.1493 Mean :0.14056 Mean :239.70
## 3rd Qu.:2.352 3rd Qu.:0.1653 3rd Qu.:0.18750 3rd Qu.:395.49
## Max. :2.743 Max. :0.2570 Max. :0.34900 Max. :439.34
## Total_N_mg_g_soil Total_P_mg_g_soil X..Moisture DOC_Soil
## Min. : 3.700 Min. : 2508 Min. :13.55 Min. :0.0810
## 1st Qu.: 7.725 1st Qu.: 3591 1st Qu.:14.48 1st Qu.:0.2015
## Median :13.910 Median : 4363 Median :15.14 Median :0.2305
## Mean :15.695 Mean : 5294 Mean :15.12 Mean :0.2173
## 3rd Qu.:24.140 3rd Qu.: 6565 3rd Qu.:15.73 3rd Qu.:0.2530
## Max. :31.910 Max. :12651 Max. :16.71 Max. :0.3280
## Extractable_N_Soil Extractable_P_Soil BG_Enz NAG_Enz
## Min. :0.00600 Min. :0.00200 Min. : 54.74 Min. : 32.66
## 1st Qu.:0.00775 1st Qu.:0.00200 1st Qu.:104.09 1st Qu.: 36.38
## Median :0.01100 Median :0.01950 Median :120.08 Median : 43.23
## Mean :0.01056 Mean :0.03269 Mean :123.19 Mean : 55.72
## 3rd Qu.:0.01200 3rd Qu.:0.05125 3rd Qu.:137.69 3rd Qu.: 58.54
## Max. :0.02100 Max. :0.10700 Max. :213.97 Max. :130.16
## LAP_Enz N_Enz PHOS_Enz CN_Enz
## Min. :23.31 Min. : 56.97 Min. :284.1 Min. :0.8193
## 1st Qu.:24.75 1st Qu.: 65.45 1st Qu.:346.1 1st Qu.:1.3609
## Median :27.31 Median : 68.71 Median :419.1 Median :1.5719
## Mean :27.10 Mean : 82.82 Mean :462.2 Mean :1.5872
## 3rd Qu.:29.28 3rd Qu.: 88.01 3rd Qu.:548.7 3rd Qu.:1.7357
## Max. :30.84 Max. :157.01 Max. :902.9 Max. :3.1758
## NP_Enz CP_Enz
## Min. :0.1172 Min. :0.1461
## 1st Qu.:0.1453 1st Qu.:0.2236
## Median :0.1691 Median :0.2667
## Mean :0.1901 Mean :0.2792
## 3rd Qu.:0.1945 3rd Qu.:0.3419
## Max. :0.4572 Max. :0.4156
# join the OTU data to 'df', but don't look at a summary of the whole dataframe
# since this will be very large -- instead we can use the square brackets ('[]')
# to look at the structure of the dataframe based on the first few columns
df <- full_join(df, dat)
## Joining, by = "SampleID"
str(df[, 1:25])
## 'data.frame': 32 obs. of 25 variables:
## $ SampleID : chr "JP1" "JP2" "JP3" "JP4" ...
## $ SampleType : Factor w/ 2 levels "root","soil": 2 2 2 2 2 2 2 2 2 2 ...
## $ PlotID : int 44 45 40 42 36 29 33 22 32 5 ...
## $ BlockID : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 3 3 3 3 2 1 ...
## $ Treatment : Factor w/ 4 levels "control","N",..: 4 1 3 2 1 2 3 3 4 3 ...
## $ Microbial_Biomass_C: num 1.99 2.12 2.4 2.19 2.15 ...
## $ Microbial_N : num 0.165 0.194 0.257 0.166 0.174 0.156 0.114 0.126 0.164 0.123 ...
## $ Microbial_P : num 0.069 0.071 0.349 0.198 0.055 0.198 0.143 0.184 0.061 0.331 ...
## $ Total_C_mg_g_soil : num 285.7 297.2 68.7 110.9 393.5 ...
## $ Total_N_mg_g_soil : num 23.68 27.26 7.35 8.42 31.91 ...
## $ Total_P_mg_g_soil : num 7322 3622 4814 7371 6313 ...
## $ X..Moisture : num 14.2 13.6 14.2 15 14.6 ...
## $ DOC_Soil : num 0.259 0.229 0.251 0.235 0.227 0.328 0.202 0.247 0.232 0.305 ...
## $ Extractable_N_Soil : num 0.007 0.014 0.009 0.011 0.006 0.008 0.006 0.011 0.012 0.009 ...
## $ Extractable_P_Soil : num 0.002 0.002 0.101 0.049 0.002 0.049 0.032 0.052 0.002 0.107 ...
## $ BG_Enz : num 105.3 100.5 142.7 131.9 99.2 ...
## $ NAG_Enz : num 32.7 39.3 130.2 121.5 35.1 ...
## $ LAP_Enz : num 24.3 27.5 26.8 29.9 23.3 ...
## $ N_Enz : num 57 66.7 157 151.4 58.4 ...
## $ PHOS_Enz : num 294 477 343 903 333 ...
## $ CN_Enz : num 1.848 1.505 0.909 0.871 1.698 ...
## $ NP_Enz : num 0.194 0.14 0.457 0.168 0.175 ...
## $ CP_Enz : num 0.358 0.211 0.416 0.146 0.298 ...
## $ ITSall_OTUa_524 : int 170 10 0 683 1172 20 33 2701 44 4990 ...
## $ ITSall_OTUa_78 : int 7 1091 1097 594 101 349 248 329 19 803 ...
# now that we have joined the data, we can convert 'PlotID' to factor
df$PlotID <- factor(df$PlotID)
# keep the workspace tidy by removing unnecessary objects
rm(dat, plots, samps, soils)
It is easy extract certain types of variables later on when necessary because I have been careful in naming my variables (more on this later). It also enhances reproducability because I am always starting from the same dataframe.
The blast results are stored in a tab-delimited test file, seen below.
The first two commands in the code below read the file into R and renames the first column. The third command keeps only the first row of the result for each OTU, this is the result with the bet evalue and bitscore. You may have other criteria that you use to select the most appropriate taxonomic annotation, requiring some modifications to this code.
# read in the data, use 'stringsAsFactors=F' to keep text as 'character'
tax <- read.delim('data/nutnet_tax.txt', stringsAsFactors=F)
# rename the first column
names(tax)[1] <- 'OTUId'
# keep only the first row of each OTU
tax <- tax[!duplicated(tax$OTUId), ]
Your taxonomic annotations may not yet be parsed into the different taxonomic levels, this can be done using the following code.
# split the 'taxonomy' column in 'tax' by the delimiter
temp <- strsplit(tax$taxonomy, ';')
# collapse into a matrix, naming each column
temp <- do.call('rbind', temp)
colnames(temp) <- c('kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species')
# any kingdom-to-genus unclassified, change to NA
temp[temp %in% grep('unclassified', temp, value=T)] <- NA
# any species unclassified, change to NA
temp[temp %in% grep('_sp$', temp, value=T)] <- NA
# clean up characters at the start of each remaining annotation
temp <- gsub('^[a-z]\\_\\_', '', temp)
# add new columns onto 'tax' data.frame
tax <- cbind(tax, temp, stringsAsFactors=F)
# clean up workspace
rm(temp)
The phyloseq library contains several helper functions for generating many of the summary plots that we might like to use to inspect the data. The first step is to split the data into its various components and then to merge into a phyloseq object.
# load library
library(phyloseq)
# extract the OTU data from 'df' and convert to matrix format
# using 'starts_with' with 'select' allows us to chose all columns containing OTU counts
otus <- as.matrix(select(df, starts_with('ITSall')))
# rows need to be named with the sample names from 'df',
# which we can do directly because they are in the same order
rownames(otus) <- df$SampleID
# extract the sample data from from 'df' and keep as dataframe format (mix of character and numeric variables)
samps <- select(df, SampleID:CP_Enz)
# rows need to be named with the sample names from 'df',
# which we can do directly because they are in the same order
rownames(samps) <- df$SampleID
# extract the taxonomy info for each OTU from 'tax' and convert to matrix format
taxonomy <- as.matrix(select(tax, kingdom:species))
# rows need to be named with the OTU names from 'tax',
# which we can do directly because they are in the same order
rownames(taxonomy) <- tax$OTUId
# merge the three objects into a single phyloseq object using 'merge_phyloseq'
# each function nexted within the call to 'merge_phyloseq' creates a special object for that type of data
# because of how the OTU matrix is oriented, we have to specify 'taxa_are_rows=F'
phy <- merge_phyloseq(otu_table(otus, taxa_are_rows=F),
sample_data(samps), tax_table(taxonomy))
# remove extra objects to keep workspace tidy
rm(otus, samps, taxonomy)
We can produce some barplots to look at the distribution of taxa in our data. plot_bar works with a phyloseq object, and we can do further modification of the plot with ggplot2 functions.
# load library
library(ggplot2)
# bar plot of phylum by treatment
plot_bar(phy, x='Treatment', fill='phylum') +
# create separate plots for each sample type
facet_wrap(~SampleType) +
# add this line to suppress OTU counts
geom_bar(aes(color=phylum), stat='identity', position='stack')
There are a lot of OTUs that have not been classified at the phylum level, we may want to exclude these. There are some others that have been classified to nonfungal taxa and we will definitely want to exclude these. The code below overwrites the existing phyloseq object after removing the OTUs that are nonfungal and unclassified at the phylum level.
# keep only fungal OTUs with at least phylum-level classification
phy <- subset_taxa(phy, kingdom=='Fungi' & !is.na(phylum))
# bar plot of phylum by treatment
plot_bar(phy, x='Treatment', fill='phylum') +
# create separate plots for each sample type
facet_wrap(~SampleType) +
# add this line to suppress OTU counts
geom_bar(aes(color=phylum), stat='identity', position='stack')
Note: if your annotations say unclassified <taxon> instead of NA, the following should work:
# keep only fungal OTUs with at least phylum-level classification
# the 'phylum %in% grep' call returns 'TRUE' if unclassified, and
# the '!' forces it to return the opposite ('FALSE')
phy <- subset_taxa(phy, kingdom == 'Fungi' &
!phylum %in% grep('unclassified', phylum, value=T))
Also note that you can modify this code to only keep certain taxa if you want to restrict your analysis to a narrow subset.
We can also produce a krona plot that we can open in a web browser and will be more convenient to inspect. The code below produces a krona plot for each level of SampleType and saves the result in an html file in the working directory. This result can be viewed here: <<
library(psadd)
plot_krona(phy, output='nutnet_ITS', variable='SampleType')
We can also produce a network graph to visualise relationships among samples. At this stage, this dataset has very high OTU turnover among samples so we have to set a very high max.dist to include all samples.
# make the network object
net <- make_network(phy, 'samples', max.dist=0.9)
# plot the network object, setting symbol colour and shape
plot_network(net, phy, color='SampleType', shape='Treatment')
## Warning: attributes are not identical across measure variables; they will
## be dropped
Finally, let’s plot an ordination to visualise relationships in multivariate space. More on the speific methods used in this ordination below.
# set up a two-panel plot
par(mfrow=c(1, 2))
# plot the ordination using multidimensional scaling of Bray-Curtis distances
plot_ordination(phy, ordinate(phy, method='PCoA', distance='bray'),
color='SampleType', shape='Treatment', label='SampleID')
# plot a scree plot showing variation explained by each axis
plot_ordination(phy, ordinate(phy, method='PCoA', distance='bray'),
type='scree')